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1. Introduction 

Research in automatic parallelization of loop-centric programs 
started with static analysis, then broadened its arsenal to include 
dynamic inspection-execution and speculative execution, the best 
results involving hybrid static-dynamic schemes. Beyond the detec- 
tion of parallelism in a sequential program, scalable parallelization 
on many-core processors involves hard and interesting parallelism 
adaptation and mapping challenges. These challenges include tai- 
loring data locality to the memory hierarchy, structuring indepen- 
dent tasks hierarchically to exploit multiple levels of parallelism, 
tuning the synchronization grain, balancing the execution load, de- 
coupling the execution into thread-level pipelines, and leveraging 
heterogeneous hardware with specialized accelerators. 

The polyhedral framework allows to model, construct and ap- 
ply very complex loop nest transformations addressing most of 
the parallelism adaptation and mapping challenges. But apart 
from hardware-specific, back-end oriented transformations (if- 
conversion, trace scheduling, value prediction), loop nest optimiza- 
tion has essentially ignored dynamic and speculative techniques. 
Research in polyhedral compilation recently reached a significant 
milestone towards the support of dynamic, data-dependent control 
flow. This opens a large avenue for blending dynamic analyses and 
speculative techniques with advanced loop nest optimizations. Se- 
lecting real-world examples from SPEC benchmarks and numerical 
kernels, we make a case for the design of synergistic static, dynamic 
and speculative loop transformation techniques. We also sketch the 
embedding of dynamic information, including speculative assump- 
tions, in the heart of affine transformation search spaces. 

2. Experimental Study 

We consider four motivating benchmarks, illustrating three combi- 
nations of dynamic analyses and loop transformations. 
Our experiments target three multicore platforms: 

• 2-socket quad-core Intel Xeon E5430, 2.66GHz, 16GB RAM 

— 8 cores; 

• 4-socket quad-core AMD Opteron 8380, 2.50GHz, 64GB RAM 

— 16 cores; 

• 4-socket hexa-core Intel Xeon E7450, 2.40GHz, 64GB RAM 

— 24 cores. 

We use OpenMP as the target of automatic and manual transfor- 
mations. Baseline and optimized codes were compiled with Intel's 
compiler ICC 1 1. 0, with options -fast -parallel -openmp. 



2.1 Dynamic techniques may be neither necessary nor 
profitable 

The SPEC CPU2000 183.equake and 179. art benchmarks have 
frequently been used to inotivate dynamic parallelization tech- 
niques. We show that static transformation and parallelization tech- 
niques can easily be extended to handle the limited degree of data- 
dependent behavior in these programs. 

Figure[T]shows the smvpO function of equake, well known for 
its "sparse" reduction pattern (a histogram computation). The value 
of col is read from an array; it is not possible to establish at com- 
pilation time whether and when dependences will occur upon ac- 
cumulating on w[col] [0]. Zhuang et al. |14| used automatically 
generated inspection slices to parallelize this loop. The inspector 
slice is a simplified version of the original loop to anticipate the 
detection of dynamic dependences. In the case of equake, it com- 
putes the values of col within a sliding window of loop iterations 
to detect possible conflicts and build a safe schedule at run-time. 

Speculation has also been used to handle unpredictable memory 
accesses in equake. Oancea et al. |7| implemented a speculative 
system to spot conflicts at runtime. When a thread detects a de- 
pendence violation, it kills other speculative threads and rolls back. 
If the number of rollbacks exceeds 1%, the execution proceeds in 
serial mode. This approach is similar to |6| which uses transac- 
tional memory to implement thread-level speculation to parallelize 
equake. Speculation is an interesting solution for dynamic paral- 
lelization, but has a high overhead due to memory access tracing, 
dependence checking, rollback and/or commit overhead. 

Interestingly, in the case of equake, one may avoid inspection 
and speculation altogether. It is sufficient to enforce atomic execu- 
tion of the sparse reduction to w [col] [0] . This can be done with 
hardware atomic instructions. An alternative is to privatize the w 
array to implement a conflict-free parallel reduction. This induces 
some overhead to scan the private arrays (as many as concurrent 
threads) and sum up the partial accumulation results. 

In the case of art, atomic execution of the tailing part of the 
matchO function is also sufficient to make an outer loop paral- 
lel, see Figure |2] Since we are also dealing with a reduction, the 
privatization alternative applies as well. 

Figure[3]compares the speedup results of static loop transforma- 
tion vs. speculative conflict management with Intel's McRT Soft- 
ware Transactional Memory (STM) 1 12|. We run the full bench- 
mark programs on their ref dataset. For equake, the static version 
uses a hardware atomic instruction version. The STM version fails 
to deliver any speedup while the version with hardware atomic in- 
structions scales reasonably wellQ For art, the static version uses 
privatization. The critical section is executed rarely and the grain 

' As already pointed out in jS]- 



for (i=0; i<nodes; i++) { 
Anext = Aindex [i] ; 
Alast = Aindex [i+1]; 

sumO = A [Anext] [0] [0]*v[l] [0] + 
A [Anext] [0] [1] *v [i] [1] + 
A [Anext] [0] [2] *v [i] [2] ; 

Anext ++ ; 

while (Anext<Alast) { 
col = Acol [Anext] ; 

sumO += A [Anext] [0] [O]*v[col] [0] + 
A [Anext] [0] [1] *v[col] [1] + 
A [Anext] [0] [2] *v[col] [2] ; 

// Sparse reduction 

w[col] [0] += A [Anext] [0] [0]*v[i] [0] + 
A [Anext] [1] [0] *v[i] [1] + 
A [Anext] [2] [0] *v[i] [2] ; 

Anext ++; 

} 

w[i] [0] += sumO; 

} 



Figure 1. equake, core of the smvpO function 

if Cinatch_conf idence > highest_confidence [winner] ) { 
highest_confidence [winner] = match_conf idence; 
set_high [winner] = TRUE; 

} 



Figure 2. art, end of the matchO function 

of parallelism is much bigger, which allows the STM version to 
yield some speedups although the statically privatized version still 
performs better. 

We also conducted experiments with different datasets. In the 
case of equake, it has a tremendous impact on the relative perfor- 
mance of the static privatization and hardware atomic versions, as 
shown in Figure|4] With the smaller train dataset, the privatization 
version outperforms the hardware atomic version because the pri- 
vate arrays fit in the cache and privatization removes all contention 
on the hardware atomic instructions. As a side effect, this result 
advocates for an adaptive compilation scheme generating multiple 
versions and a decision tree to dynamically select the most appro- 
priate version depending on program behavior and/or on features 
of the input data. 

For both benchmarks, we expect more complex loop transfor- 
mations like loop tiling to further improve scalability. But we could 
not yet find a tool to automate the process. Instead of pursuing the 
manual transformation exploration, we prefer to test this hypothesis 
on another set of benchmarks more amenable to automatic paral- 
lelization with classical loop transformation tools. This will be the 
subject of the next section. 





S 

8 


tatic on 
16 


y 

24 


8 


STM 
16 


24 


equake 


2.71 


6.51 


7.18 


0.22 


0.34 


0.24 


art 


3.69 


4.29 


4.26 


3.60 


3.69 


3.98 



Figures. Speedups for equake and art 



2.2 Complex loop transformations on data-dependent 
control flow 

Dynamic inspection and speculation are very appropriate for 
equake, but we showed that it is not strictly needed to resort to 
dynamic analysis to achieve good performance. Moreover it is not 
always possible to generate lightweight instrumentation slices or 
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Figure 4. Speedups for variants of equake on train and ref 
datasets 



profitable speculation schemes in general f9]. A good example 
where we can not extract an efficient inspector is the Givens ro- 
tation kernel in Figure |5] It features two nested data-dependent 
conditions to distinguish between different complex sine/cosine 
computations. These conditions prevent optimization and paral- 
lelization in classical frameworks restricted to affine conditional 
expressions and loop bounds |2 5j. 

There are loop-carried dependences from and to all three condi- 
tional branches. These are flow dependences and cannot be elimi- 
nated by array expansion (privatization, renaming) techniques. No 
dynamic parallelism detection method alone can find scalable par- 
allelism on this example: it may only extract parallelism in the inner 
loops (which can also be extracted with static techniques, but does 
not bring significant performance benefit). The loop nest must be 
transformed to express coarser grain parallelism at the outer loop 
level. This is the specialty of affine transformations in the polyhe- 
dral framework: here, a composition of privatization, loop skewing 
and loop tiling would be possible 1 5 1. The question is how to auto- 
mate this transformation. 

Fortunately, profiling the kernel shows that the third (else) 
branch is almost always executed on dense matrices. With the 
assumption that the third branch is almost always executed, one 
may speculatively ignore the dependences arising from the first 
two branches. Even better, one may virtually eliminate the if 
conditionals from the loop nest, yielding a static control loop nest. 
With these assumptions PLUTO |2| is able to tile the loop nest, 
which greatly enhance the scalability of the parallelization. The 
first part of result is shown in Figure |60 As announced earlier, it 
is more complex than tiling: the loop nest also needs to be skewed 
to allow the outer loops to be permuted. This transformation is 
always correct, even when the control flow takes one of the two 
cold branches. It happens to preserve the dependences arising from 
the two cold branches as well. We are in an ideal situation where the 
speculative assumption offers extra flexibility in applying complex 
transformations, but does not incur any runtime overhead. The end 
result is a 7.02 x speedup on 8 cores for 5000 x 5000 matrices, see 
Figure [8] 

Interestingly, the fact that dependences are compatible with a 
composition and loop skewing and loop tiling can also be captured 
with conservative, purely static methods, as demonstrated by Ben- 
abderrahmane et al. |T|, resulting in the exact same code. 

2.3 Dynamic techniques lielping loop transformations 

In the previous experiments, different static methods were always 
capable of extracting scalable parallelism. Figure |7] shows the for- 
ward elimination step of the Gauss- J kernel, a Gauss-Jordan elim- 
ination algorithm looking for zero diagonal elements at each elim- 
ination step. Pivoting is the main source of data-dependent control 
flow. 

Just like the Givens rotation kernel, a combination of skewing 
and tiling is required to achieve the best performance. Static analy- 



^floord(n, d) and ceild(n, d) implement n/d and (n + d — l)/d 
respectively, where / is the Euclidian division and not the truncating integer 
division of C and most ISAs. 



for (k=0; k<N; k++) ■[ 

for (i=0; i<M-l-k; i++) { 

if (A_r[i+1] [k] == 0.0 M A_i[i+l][k] == 0.0) { 
// Data-dependent condition, rarely executed 
for (j=k; j<N; { 
tl_r = A_r[i+1] [j] ; 
tl_i = A_i[i+1] [j] ; 
t2.r = A_r[i] [j] ; 
t2_i = A_i[i] [j] ; 
A.r[i] [j] = tl_r; 
A.i[i] [j] = tl_i; 
A_r[i+1] [j] = t2_r; 
A_i[i+1] [j] = t2_i; 

} 

} else If (A_r[i] [k] == 0.0 M A_l[i][k] == 0.0) { 
// Data-dependent condition, rarely executed 
ng = sqrt(A_r[i+l] [k]*A_r[i+l] [k] 

+ A_i [i+1] [k] *A_i [i+1] [k] ) ; 
s_r = A_r[i+1] [k] / ng; 
s_i = -A_i[i+l][k] / ng; 
for (j=k; j<N; j++) { 

tl_r = -s_r*A_r [i] [j] - s_i*A_i[i] [j] ; 

tl_i = -s_r*A_i [i] [j] + s_i*A_r[i] [j] ; 

t2_r = s_r*A_r[i+l] [j] - s_i*A_i [1+1] [j] ; 

t2_l = s_r*A_i [i+1] [j] + s_i*A_r [1+1] [j] ; 

A.r[i] [j] = tl_r; 

A.i[i] [j] = tl_i; 

A_r[i+1] [j] = t2_r; 

A_i[i+1] [j] = t2_i; 

} 

} else -[ 

// Most frequently executed case 

nm = sqrt(A.r[l] [k] * A_r[l][k] + A_i[i][k] * A.l[i][k] + 
A_r[i+l][k] * A_r[i+l][k] + A.l[i+l][k] * A.l [1+1] [k] ) ; 
nf = sqrt(A_r[l] [k] * A_r[i][k] + A_i[i][k] * A_l[i][k]); 
sig_r = A_r[i] [k] / nf ; 
sig_i = A_i[l][k] / nf; 
c_r = nf / nm; 

s_r = (sig_r * A_r[i+l][k] + sig_i * A_i[i+l][k]) / nm; 
s_i = (sig_i * A_r[i+1] [k] - sig_r * -A_i [i+1] [k] ) / nm; 
for (j=k; j<N; j++) i 

tl_r = -s_r*A_r[i] [j] - s_i*A_i [i] [j] + c_r*A_r [i+1] [j] ; 

tl_l = -s_r*A_i [i] [j] + s_i*A_r [i] [j] + c_r*A_i [i+1] [j] ; 

t2_r = c_r*A_r[i] [j] + s_r*A_r [i+1] [j] - s_i*A_i [i+1] [j] ; 

t2_l = c_r*A_i[l] [j] + s_r*A_i [i+1] [j] + s_i*A_r[i+l] [j] ; 

A_r[i] [j] = tl_r; 

A_i[i] [j] = tl_i; 

A_r[i+1] [j] = t2_r; 

A_i[i+1] [j] = t2_i; 

} 

} 

> 

} 



Figure 5. Givens kernel 



sis alone only extracts parallelism on the intermediate i loop, lead- 
ing to a weak 1.5 x speedup on 8 cores. Dynamic analysis amounts 
to speculatively assuming that diagonal elements are not null, hence 
that row permutations will be infrequent. Such a speculative as- 
sumption can be used to enable more aggressive loop transforma- 
tions: it "virtually" eliminates the dependences due to row permu- 
tations, and enables PLUTO to discover the composition of skew- 
ing and tiling we were hoping for. The transformed loop nest fol- 
lows a similar pattern as Givens, with extra conflict detection code. 
It may lead to sequential recomputation of the algorithm on the 
south-western part of the matrix defined by an offending diagonal 
coefficient. 

Figure[8]shows the ideal results on a 10000 x 10000 random ma- 
trix where pivoting is never required. Coarse grain parallelization 
and locality optimization through tiling yield a super-linear 10.54x 
speedup (both original and transformed versions are automatically 
and fully vectorized). 



// Skewed and tiled outer loops 

for (cO=-l; cO<=min(floord(M-2, 16), f loord(N+M-3 , 32)); cO++) { 
Ibl = max(max(max(0, ceild(32*cO-M+2 , 32) ) , 

celld(32*cO-N+l, 32)), ceild(32*c0-31 , 64) ) ; 
ubl = min(floord(M-2, 32), f loord(32*c0+31 , 32)); 

// Parallel loop on coarse-grain blocks 
#pragma omp parallel for sharedCcO,lbl,ubl) \ 

private Ccl , c2 , c3 , c4 , condl , cond2) 
for (cl=lbl; cl<=ubl; cl++) { 
if (cO <= cl) { 

for (c3=max(0,32*cl) ;c3<=min(M-2,32*cl+31) ;c3++) { 
condl = (A_r[c3+1] [0] == 0.0 M A_i[c3+1][0] == 0.0); 
cond2 = (A_r[c3][0] == 0.0 Site A_i[c3][0] == 0.0); 
if (condl) { 

for (c4=0;c4<=K-l;c4++) { 
tl_r = A_r[c3+1] [c4] ; 
tl_i = A_i[c3+1] [c4] ; 
t2_r = A_r[c3] [c4] ; 
t2_i = A_i[c3] [c4] ; 
A_r[c3] [c4] = tl_r; 
A_l[c3] [c4] = tl_i; 
A_r[c3+1] [c4] = t2_r; 
A_i[c3+1] [c4] = t2_i; 

} 

} else if (cond2) { 
ng = sqrt(A_r[c3+l] [0]*A_r[c3+l] [0] 

+ A_l [c3+l] [0] *A_i [c3+l] [0] ) ; 
s_r = A_r[c3+1] [0] / ng; 
s_i = -A.i[c3+1] [0] / ng; 
for (c4=0;c4<=N-l;c4++) { 

tl_r = -s_r*A_r[c3] [c4] - s_i*A_i [c3] [c4] ; 

tl_i = -s_r*A_i[c3] [c4] + s_i*A_r [c3] [c4] ; 

t2_r = s_r*A_r[c3+l] [c4] - s_l*A_i [c3+l] [c4] ; 

t2_i = s_r*A_i [c3+l] [c4] + s_i*A_r [c3+l] [c4] ; 

A_r[c3] [c4] = tl_r; 

A_i[c3] [c4] = tl_i; 

A_r[c3+1] [c4] = t2_r; 

A_i[c3+1] [c4] = t2_i; 

} 

} else { 
nm = sqrt(A_r[c3] [0]*A_r[c3] [0] 

+ A_i[c3] [0]*A_i[c3] [0] 

+ A_r[c3+1] [0]*A_r[c3+l] [0] 

+ A_l [c3+l] [0] *A_i [c3+l] [0] ) ; 
nf = sqrt(A_r[c3] [0] * A_r[c3][0] 

+ A_l[c3][0] * A_i[c3][0]); 
slg.r = A_r[c3] [0] / nf ; 
slg_i = A_i[c3] [0] / nf ; 
c_r = nf/nm; 

s_r = (sig_r*A_r[c3+l] [0] + slg_i*A_i [c3+l] [0] ) / nm; 
s_i = (sig_i*A_r[c3+l] [0] - slg_r*A_l [c3+l] [0] ) / nm; 
for (c4=0;c4<=N-l;c4++) { 

tl_r = -s_r*A_r[c3] [c4] - s_i*A_i [c3] [c4] 

+ c_r*A_r[c3+l] [c4] ; 
tl_i = -s_r*A_i[c3] [c4] + s_i*A_r [c3] [c4] 

+ c_r*A_i [c3+l] [c4] ; 
t2_r = c_r*A_r[c3] [c4] + s_r*A_r [c3+l] [c4] 

- s_i*A_i [c3+l] [c4] ; 
t2_i = c_r*A_i [c3] [c4] + s_r*A_i [c3+l] [c4] 

+ s_l*A_r[c3+l] [c4] ; 
A_r[c3] [c4] = tl_r; 
A_i[c3] [c4] = tl_i; 
A_r[c3+1] [c4] = t2_r; 
A_i[c3+1] [c4] = t2_i; 

} 

} 

} 

/* And much more +/ 



Figure 6. Optimized Givens kernel (part) 

In practice, the speculation always succeeds in the case of pos- 
itive definite matricesQ Positive definite matrices have other inter- 
esting properties such as being nonsingular, having its largest el- 
ement on the diagonal, and having all positive diagonal elements. 
No (partial) pivoting is necessary for a strictly column diagonally 



A matrix A is positive definite if x.^A x > for all nonzero x. 



for (k=l; k<=n-l; ++k) 
{ 

// Make sure that diagonal element is not null 
// 1st data-dependent condition 
if (a[k] [k] == 0) 
{ 

amax = abs Ca [k] [k] ) ; 
m = k; 

for (i=k+l; i<=n; i++) 
// Find the row with largest pivot 
for Ci=k+1; i<=n; i++) { 
aabs = abs(a[i] [k] ) ; 
// 2nd data-dependent condition 
if (aabs > amax) { 
amax = aabs ; 
m = i ; 

} 

} 

// Row permutation 

// 3rd data-dependent condition 

if (m != k) { 

swap Cb [m] , b [k] ) ; 

for (j=k; j<=n; { 
SMap(a[k] [j] , a[m][j]); 

} 

> 

> 

// Update a[] [] 

for {i=k+i; i<=n; i++) { 

xfac = a[l] [k] / a[k] [k] ; 

for (j=k+l; j<=n; { 

a[i][j] = a[i][j] - xfac*a[k] [j] ; 

} 

b[i] = b[i] - xfac*b[k] ; 

> 

} 



Figure 7. Forward reduction step of Gauss-J 
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Figure 8. Givens and (ideal) Gauss-J speedups on the 8-core target 

dominant matrix when performing Gaussian elimination or LU fac- 
torization. Fortunately, many matrices that arise in finite element 
methods are diagonally dominant: Figure [9] shows the speedups of 
Gauss-J running on different matrices of the Harwell-Boeing col- 
lection. Performance variations are due to the matrix size and mis- 
speculations. 
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Figure 9. Gauss-J speedups on the 8-core target, with different 
Harwell-Boeing matrices 

3. Towards Synergistic Transformations and 
Dynamic Analyses 

The previous experiments show that loop transformations can be 
very profitable on dynamic, data-dependent control flow. Some- 



times, conservative results of static analyses are sufficient to en- 
able these transformations and achieve scalable parallelism. But 
our point is not to oppose static and dynamic methods. It is much 
more interesting to study the impact of dynamic information on 
the effectiveness of loop nest transformations, and to exploit static 
analysis knowledge to focus the dynamic analysis effort. 

The Gauss-J kernel shows that excellent results can be expected 
when operating dynamic analyses (inspection, speculation) and ag- 
gressive loop nest optimizations in synergy. Indeed, we expect that 
the benefits of static and dynamic methods can nurture each other 
in a large number of parallelization and loop transformation prob- 
lems. Under conservative analysis hypotheses, it may be possible 
to transform the control flow to generate more efficient dynamic 
analysis code; the result of these analyses may authorize bolder 
hypotheses on the dependences (speculative or not), which in turn 
open for more aggressive loop transformations. 

Our study is still too preliminary to demonstrate the effective 
profitability of such a synergistic approach on full applications. 
However, it is already possible to sketch the principles of a polyhe- 
dral compilation framework embedding dynamic information into 
its search space construction, and generating inspection, conflict 
detection and/or recovery code automatically (and on demand). 

The polyhedral framework captures three important compo- 
nents of the semantics of a loop nest in a rich, algebraic frame- 
work. These components are the iteration domains (the set of loop 
iterations) of all statements, the access functions for all array refer- 
ences in these statements, and multidimensional scheduling func- 
tions to capture the relative ordering of the statement iterations. 
These three components are represented as systems of affine in- 
equalities (unions of convex polyhedra). Affine transformations are 
pushing their way into production compilers, including GCC H3l 
and IBM XL, leveraging two unique advantages: 

• arbitrarily complex compositions of loop transformations can 
be represented, while offering a flexible framework to validate 
their legality (5); 

• well-structured search spaces can be built, allowing the design 
effective heuristics to derive such complex sequences of loop 
transformations automatically, addressing the parallelism and 
locality interplay of modern architectures L2.4 I. 

The recent work of Benabderrahmane et al. extends the ap- 
plicability of the polyhedral framework to data-dependent control 
flow 1 1 1, but it still relies on conservative results from static anal- 
ysis. There is a clear opportunity to refine the set of affine depen- 
dence constraints defining the search space of affine transforma- 
tions. The main challenge is to capture the outcome of the data- 
dependent condition of an inspection or conflict detection slice. The 
condition itself cannot be precisely characterized statically (other- 
wise there would be no justification for dynamic analysis); but it 
has been generated by a previous compilation pass that can be de- 
signed to retain the causal relation between the outcome of the con- 
dition and the presence of a dependence constraint over a specific 
set of statement instances. 

For example, considering Gauss-J again, a negative outcome of 
the first data-dependent condition a[k] [k] == guarantees the 
absence of any dependence involving the row-swapping statements. 
This is the very speculative hypothesis that enabled loop tiling, 
improving locality and reducing synchronization overhead. 

Since dynamic techniques can also benefit from loop transfor- 
mations to become more effective and mitigate their intrinsic over- 
head, it would be ideal to derive the inspection, conflict detection or 
recovery code from the assumptions made in the polyhedral repre- 
sentation itself. For example, a parallelization heuristic may choose 
to weight dependences according to their likelihood to occur at run- 
time (based on profile data), and to ignore some of these depen- 



dences when looking for profitable affine transformations. Once a 
good candidate composition of loop transformations is found, the 
polyhedral code generator produces not only the transformed (par- 
allel) loop nest, but also the interleaved dynamic analysis code to 
validate the original assumption. This is exactly the principle of 
hybrid analysis by Rus et al. ill ], but extended beyond parallelism 
detection and towards the validation of arbitrarily complex loop 
nest transformations. 

Considering Gauss- J once again, an expert programmer can 
easily guess that the first data-dependent condition has a good 
predictability potential on some relevant classes of matrices, and 
that the second data-dependent condition aabs > amax is very un- 
likely to be a relevant candidate for speculation because it amounts 
to precisely predicting the row of the maximum pivot. A compiler 
looking for speculative execution points may not be able to figure 
this out statically, but it can rely on offline profiling, or multiver- 
sioning and online profiling. Before opting for a more expensive 
speculation strategy, the compiler can leverage static dependence 
information to discover that a lightweight inspection scheme is not 
sufficient to enable loop tiling: the permutation-hampering depen- 
dences would be detected too late, until after the completion of the 
i loop of the update part. In addition, the compiler can also use 
static dependence information to figure out the actual impact of the 
speculative hypothesis. Speculating on the negative outcome of the 
first condition is sufficient to enable loop tiling, but a highly pre- 
dictable condition that does not help refining the dependence con- 
straints is unlikely to be a good speculation candidate in general. 
Both predictability and dependence disambiguation are required: 
this is exactly the objective of the sensitivity analysis by Rus et 
al. fTOil . which we would like to revisit in the context of polyhedral 
compilation. 

4. Conclusion 

This paper does not attempt to be complete, in terms of state-of- 
the-art transformations or dynamic analysis techniques. Our goal is 
to study whether the effectiveness of parallelizing compilers can or 
cannot be improved when blending static and dynamic techniques 
rather than opposing them. Our findings show that there is a strong 
potential in following this path: 

• aggressive loop nest optimizations are required for scalability, 
and it is possible to enable them on data-dependent control- 
flow; 

• it is possible and profitable to leverage dynamic analysis infor- 
mation to enhance the effectiveness and applicability of loop 
transformations. 

We also sketched how to embed dynamic information into affine 
transformation spaces, while synthesizing inspection and/or specu- 
lation code automatically. 

We are working on fully automating these techniques. We also 
plan to extend parallelism detection among acyclic control-flow re- 
gions nested into loop nests, combining affine loop transformations 
with decoupled software pipelining |8|. 
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